{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "03_collaborative_filtering_colab.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/mlelarge/dataflowr/blob/master/CEA_EDF_INRIA/03_collaborative_filtering_full_colab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cKEoXRSzUUCv",
        "colab_type": "text"
      },
      "source": [
        "# Collaborative filtering\n",
        "-----\n",
        "\n",
        "In this example, we'll build a quick explicit feedback recommender system: that is, a model that takes into account explicit feedback signals (like ratings) to recommend new content.\n",
        "\n",
        "We'll use an approach first made popular by the [Netflix prize](http://www.netflixprize.com/) contest: [matrix factorization](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf). \n",
        "\n",
        "The basic idea is very simple:\n",
        "\n",
        "1. Start with user-item-rating triplets, conveying the information that user _i_ gave some item _j_ rating _r_.\n",
        "2. Represent both users and items as high-dimensional vectors of numbers. For example, a user could be represented by `[0.3, -1.2, 0.5]` and an item by `[1.0, -0.3, -0.6]`.\n",
        "3. The representations should be chosen so that, when we multiplied together (via [dot products](https://en.wikipedia.org/wiki/Dot_product)), we can recover the original ratings.\n",
        "4. The utility of the model then is derived from the fact that if we multiply the user vector of a user with the item vector of some item they _have not_ rated, we hope to obtain a predicition for the rating they would have given to it had they seen it.\n",
        "\n",
        "![collaborative filtering](http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png)\n",
        "source:[ampcamp.berkeley](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OeHGO4qbUUC4",
        "colab_type": "text"
      },
      "source": [
        "## 1. Preparations"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "frj5rX9wUUC_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%matplotlib inline\n",
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import os.path as op\n",
        "\n",
        "from zipfile import ZipFile\n",
        "try:\n",
        "    from urllib.request import urlretrieve\n",
        "except ImportError:  # Python 2 compat\n",
        "    from urllib import urlretrieve\n",
        "\n",
        "# this line need to be changed:\n",
        "data_folder = '/content/'\n",
        "\n",
        "ML_100K_URL = \"http://files.grouplens.org/datasets/movielens/ml-100k.zip\"\n",
        "ML_100K_FILENAME = op.join(data_folder,ML_100K_URL.rsplit('/', 1)[1])\n",
        "ML_100K_FOLDER = op.join(data_folder,'ml-100k')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pK7DyF_ZUUDQ",
        "colab_type": "text"
      },
      "source": [
        "We start with importing a famous dataset, the [Movielens 100k dataset](https://grouplens.org/datasets/movielens/100k/). It contains 100,000 ratings (between 1 and 5) given to 1683 movies by 944 users:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tKfA7pagUUDT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "if not op.exists(ML_100K_FILENAME):\n",
        "    print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))\n",
        "    urlretrieve(ML_100K_URL, ML_100K_FILENAME)\n",
        "\n",
        "if not op.exists(ML_100K_FOLDER):\n",
        "    print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))\n",
        "    ZipFile(ML_100K_FILENAME).extractall(data_folder)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LKXSFZpUUUDl",
        "colab_type": "text"
      },
      "source": [
        "Other datasets, see: [Movielens](https://grouplens.org/datasets/movielens/)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hHgLg4jBUUDp",
        "colab_type": "text"
      },
      "source": [
        "## 2. Data analysis and formating"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZlGyPSBpUUDt",
        "colab_type": "text"
      },
      "source": [
        "[Python Data Analysis Library](http://pandas.pydata.org/)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QFRB1aAPUUDy",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pandas as pd\n",
        "\n",
        "all_ratings = pd.read_csv(op.join(ML_100K_FOLDER, 'u.data'), sep='\\t',\n",
        "                          names=[\"user_id\", \"item_id\", \"ratings\", \"timestamp\"])\n",
        "all_ratings.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "enyJuYhUUUEL",
        "colab_type": "text"
      },
      "source": [
        "Let's check out a few macro-stats of our dataset"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3M7-jinQUUEO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#number of entries\n",
        "len(all_ratings)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ofVcE781UUEa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "all_ratings['ratings'].describe()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tCdV2qzMUUEk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# number of unique rating values\n",
        "len(all_ratings['ratings'].unique())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zU9vARwJUUEs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "all_ratings['user_id'].describe()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "P1QoS71CUUE7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# number of unique users\n",
        "total_user_id = len(all_ratings['user_id'].unique())\n",
        "print(total_user_id)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "l3gfokbNUUFH",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "all_ratings['item_id'].describe()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pE6-aj0KUUFY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# number of unique rated items\n",
        "total_item_id = len(all_ratings['item_id'].unique())\n",
        "print(total_item_id)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ur4bjuniUUFj",
        "colab_type": "text"
      },
      "source": [
        "For spliting the data into _train_ and _test_ we'll be using a pre-defined function from [scikit-learn](http://scikit-learn.org/stable/)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZT5oxhoGUUFm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "ratings_train, ratings_test = train_test_split(\n",
        "    all_ratings, test_size=0.2, random_state=42)\n",
        "\n",
        "user_id_train = ratings_train['user_id']\n",
        "item_id_train = ratings_train['item_id']\n",
        "rating_train = ratings_train['ratings']\n",
        "\n",
        "user_id_test = ratings_test['user_id']\n",
        "item_id_test = ratings_test['item_id']\n",
        "rating_test = ratings_test['ratings']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aLem0iAjUUFs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(user_id_train)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "45HhGRbFUUF2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(user_id_train.unique())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "z9m6gSoKUUF_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(item_id_train.unique())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vmx1YZmTUUGG",
        "colab_type": "text"
      },
      "source": [
        "We see that all the movies are not rated in the train set."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TOY6YCqPUUGI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "user_id_train.iloc[:5]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6ZCYvrYhUUGP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "item_id_train.iloc[:5]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qZsssRYSUUGW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "rating_train.iloc[:5]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q7JF-d05UUGc",
        "colab_type": "text"
      },
      "source": [
        "## 3. The model\n",
        "\n",
        "We can feed our dataset to the `FactorizationModel` class - a sklearn-like object that allows us to train and evaluate the explicit factorization models.\n",
        "\n",
        "Internally, the model uses the `Model_dot`(class to represents users and items. It's composed of a 4 `embedding` layers:\n",
        "\n",
        "- a `(num_users x latent_dim)` embedding layer to represent users,\n",
        "- a `(num_items x latent_dim)` embedding layer to represent items,\n",
        "- a `(num_users x 1)` embedding layer to represent user biases, and\n",
        "- a `(num_items x 1)` embedding layer to represent item biases."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tsrfFi1QUUGd",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import torch.nn as nn\n",
        "import torch"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vaf7-kdnUUGh",
        "colab_type": "text"
      },
      "source": [
        "Let's generate [Embeddings](http://pytorch.org/docs/master/nn.html#embedding) for the users, _i.e._ a fixed-sized vector describing the user"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2iu-1X4AUUGi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "embedding_dim = 3\n",
        "embedding_user = nn.Embedding(total_user_id, embedding_dim)\n",
        "input = torch.LongTensor([[1,2,4,5],[4,3,2,0]])\n",
        "embedding_user(input)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HpkKu7PVUUGp",
        "colab_type": "text"
      },
      "source": [
        "We will use some custom embeddings and dataloader"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eB6_y1nMUUGq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class ScaledEmbedding(nn.Embedding):\n",
        "    \"\"\"\n",
        "    Embedding layer that initialises its values\n",
        "    to using a normal variable scaled by the inverse\n",
        "    of the emedding dimension.\n",
        "    \"\"\"\n",
        "\n",
        "    def reset_parameters(self):\n",
        "        \"\"\"\n",
        "        Initialize parameters.\n",
        "        \"\"\"\n",
        "\n",
        "        self.weight.data.normal_(0, 1.0 / self.embedding_dim)\n",
        "        if self.padding_idx is not None:\n",
        "            self.weight.data[self.padding_idx].fill_(0)\n",
        "\n",
        "\n",
        "class ZeroEmbedding(nn.Embedding):\n",
        "    \"\"\"\n",
        "    Used for biases.\n",
        "    \"\"\"\n",
        "\n",
        "    def reset_parameters(self):\n",
        "        \"\"\"\n",
        "        Initialize parameters.\n",
        "        \"\"\"\n",
        "\n",
        "        self.weight.data.zero_()\n",
        "        if self.padding_idx is not None:\n",
        "            self.weight.data[self.padding_idx].fill_(0)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ktXRW3-4UUGt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class DotModel(nn.Module):\n",
        "    \n",
        "    def __init__(self,\n",
        "                 num_users,\n",
        "                 num_items,\n",
        "                 embedding_dim=32):\n",
        "        \n",
        "        super(DotModel, self).__init__()\n",
        "        \n",
        "        self.embedding_dim = embedding_dim\n",
        "        \n",
        "        self.user_embeddings = ScaledEmbedding(num_users, embedding_dim)\n",
        "        self.item_embeddings = ScaledEmbedding(num_items, embedding_dim)\n",
        "        self.user_biases = ZeroEmbedding(num_users, 1)\n",
        "        self.item_biases = ZeroEmbedding(num_items, 1)\n",
        "                \n",
        "        \n",
        "    def forward(self, user_ids, item_ids):\n",
        "        \n",
        "        user_embedding = self.user_embeddings(user_ids)\n",
        "        item_embedding = self.item_embeddings(item_ids)\n",
        "\n",
        "        user_embedding = user_embedding.squeeze()\n",
        "        item_embedding = item_embedding.squeeze()\n",
        "\n",
        "        user_bias = self.user_biases(user_ids).squeeze()\n",
        "        item_bias = self.item_biases(item_ids).squeeze()\n",
        "\n",
        "        dot = (user_embedding * item_embedding).sum(1)\n",
        "\n",
        "        return dot + user_bias + item_bias\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Lr_nC6eSUUGx",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def gpu(tensor, gpu=False):\n",
        "\n",
        "    if gpu:\n",
        "        return tensor.cuda()\n",
        "    else:\n",
        "        return tensor\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4hGbjMgUgNsB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "use_cuda = torch.cuda.is_available()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RLwD8cOthZ2D",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "use_cuda"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PlbbwU1ze5d7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "net = gpu(DotModel(total_user_id,total_item_id), use_cuda)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ysX9Q9pxiMG4",
        "colab_type": "text"
      },
      "source": [
        "Now test your network on a small batch."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Cp2ZOCFhfERq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "batch_users_np = user_id_train.values[:5].astype(np.int32)\n",
        "batch_items_np = item_id_train.values[:5].astype(np.int32)\n",
        "batch_ratings_np = rating_train[:5].values.astype(np.float32)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JzrP3hVcf1uY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "batch_users_tensor = gpu(torch.LongTensor(batch_users_np), use_cuda)\n",
        "batch_items_tensor = gpu(torch.LongTensor(batch_items_np), use_cuda)\n",
        "batch_ratings_tensor = gpu(torch.tensor(batch_ratings_np), use_cuda)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8K4btQb2gsAg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "predicitions = net(batch_users_tensor,batch_items_tensor)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "I20UhBLZjBEs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "predicitions"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZGG3SBS9jizs",
        "colab_type": "text"
      },
      "source": [
        "We will use MSE loss defined below:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7sqlTLCyiwzl",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def regression_loss(predicted_ratings, observed_ratings):\n",
        "    return ((observed_ratings - predicted_ratings) ** 2).mean()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wK5LqsRXi0_G",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss_fn = regression_loss\n",
        "loss = loss_fn(predicitions, batch_ratings_tensor)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ins38kThjNl0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MPpfdFaikudZ",
        "colab_type": "text"
      },
      "source": [
        "Check that your network is learning by overfitting your network on this small batch (you should reach a loss below 0.5 in the cell below)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TcIY0t-xkFwT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "net = gpu(DotModel(total_user_id,total_item_id), use_cuda)\n",
        "optimizer = torch.optim.Adam(net.parameters(), lr = 0.1)\n",
        "for e in range(10):\n",
        "  predicitions = net(batch_users_tensor,batch_items_tensor)\n",
        "  loss = loss_fn(predicitions, batch_ratings_tensor)\n",
        "  print(loss)\n",
        "  optimizer.zero_grad()\n",
        "  loss.backward()\n",
        "  optimizer.step()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CYeBlERRUUG0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def shuffle(*arrays):\n",
        "\n",
        "    random_state = np.random.RandomState()\n",
        "    shuffle_indices = np.arange(len(arrays[0]))\n",
        "    random_state.shuffle(shuffle_indices)\n",
        "\n",
        "    if len(arrays) == 1:\n",
        "        return arrays[0][shuffle_indices]\n",
        "    else:\n",
        "        return tuple(x[shuffle_indices] for x in arrays)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OmlQTG2FUUG4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def minibatch(batch_size, *tensors):\n",
        "\n",
        "    if len(tensors) == 1:\n",
        "        tensor = tensors[0]\n",
        "        for i in range(0, len(tensor), batch_size):\n",
        "            yield tensor[i:i + batch_size]\n",
        "    else:\n",
        "        for i in range(0, len(tensors[0]), batch_size):\n",
        "            yield tuple(x[i:i + batch_size] for x in tensors)\n",
        "\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2ekPPr7SUUG7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import imp\n",
        "import numpy as np\n",
        "\n",
        "import torch.optim as optim\n",
        "\n",
        "class FactorizationModel(object):\n",
        "    \n",
        "    def __init__(self,\n",
        "                 embedding_dim=32,\n",
        "                 n_iter=10,\n",
        "                 batch_size=256,\n",
        "                 l2=0.0,\n",
        "                 learning_rate=1e-2,\n",
        "                 use_cuda=False,\n",
        "                 net=None,\n",
        "                 num_users=None,\n",
        "                 num_items=None, \n",
        "                 random_state=None):\n",
        "        \n",
        "        self._embedding_dim = embedding_dim\n",
        "        self._n_iter = n_iter\n",
        "        self._learning_rate = learning_rate\n",
        "        self._batch_size = batch_size\n",
        "        self._l2 = l2\n",
        "        self._use_cuda = use_cuda\n",
        "        \n",
        "        self._num_users = num_users\n",
        "        self._num_items = num_items\n",
        "        self._net = net\n",
        "        self._optimizer = None\n",
        "        self._loss_func = None\n",
        "        self._random_state = random_state or np.random.RandomState()\n",
        "             \n",
        "        \n",
        "    def _initialize(self):\n",
        "        if self._net is None:\n",
        "            self._net = gpu(DotModel(self._num_users, self._num_items, self._embedding_dim),self._use_cuda)\n",
        "        \n",
        "        self._optimizer = optim.Adam(\n",
        "                self._net.parameters(),\n",
        "                lr=self._learning_rate,\n",
        "                weight_decay=self._l2\n",
        "            )\n",
        "        \n",
        "        self._loss_func = regression_loss\n",
        "    \n",
        "    @property\n",
        "    def _initialized(self):\n",
        "        return self._optimizer is not None\n",
        "    \n",
        "    def __repr__(self):\n",
        "        return _repr_model(self)\n",
        "    \n",
        "    def fit(self, user_ids, item_ids, ratings, verbose=True):\n",
        "        \n",
        "        user_ids = user_ids.astype(np.int64)\n",
        "        item_ids = item_ids.astype(np.int64)\n",
        "        \n",
        "        if not self._initialized:\n",
        "            self._initialize()\n",
        "            \n",
        "        for epoch_num in range(self._n_iter):\n",
        "            users, items, ratingss = shuffle(user_ids,\n",
        "                                            item_ids,\n",
        "                                            ratings)\n",
        "\n",
        "            user_ids_tensor = gpu(torch.from_numpy(users),\n",
        "                                  self._use_cuda)\n",
        "            item_ids_tensor = gpu(torch.from_numpy(items),\n",
        "                                  self._use_cuda)\n",
        "            ratings_tensor = gpu(torch.from_numpy(ratingss),\n",
        "                                 self._use_cuda)\n",
        "            epoch_loss = 0.0\n",
        "\n",
        "            for (minibatch_num,\n",
        "                 (batch_user,\n",
        "                  batch_item,\n",
        "                  batch_rating)) in enumerate(minibatch(self._batch_size,\n",
        "                                                         user_ids_tensor,\n",
        "                                                         item_ids_tensor,\n",
        "                                                         ratings_tensor)):\n",
        "                \n",
        "                \n",
        "        \n",
        "                predictions = self._net(batch_user, batch_item)\n",
        "\n",
        "                self._optimizer.zero_grad()\n",
        "                \n",
        "                loss = self._loss_func(predictions,batch_rating)\n",
        "                \n",
        "                epoch_loss = epoch_loss + loss.data.item()\n",
        "                \n",
        "                loss.backward()\n",
        "                self._optimizer.step()\n",
        "                \n",
        "            \n",
        "            epoch_loss = epoch_loss / (minibatch_num + 1)\n",
        "\n",
        "            if verbose:\n",
        "                print('Epoch {}: loss {}'.format(epoch_num, epoch_loss))\n",
        "        \n",
        "            if np.isnan(epoch_loss) or epoch_loss == 0.0:\n",
        "                raise ValueError('Degenerate epoch loss: {}'\n",
        "                                 .format(epoch_loss))\n",
        "    \n",
        "    \n",
        "    def test(self,user_ids, item_ids, ratings):\n",
        "        self._net.train(False)\n",
        "        user_ids = user_ids.astype(np.int64)\n",
        "        item_ids = item_ids.astype(np.int64)\n",
        "        \n",
        "        user_ids_tensor = gpu(torch.from_numpy(user_ids),\n",
        "                                  self._use_cuda)\n",
        "        item_ids_tensor = gpu(torch.from_numpy(item_ids),\n",
        "                                  self._use_cuda)\n",
        "        ratings_tensor = gpu(torch.from_numpy(ratings),\n",
        "                                 self._use_cuda)\n",
        "               \n",
        "        predictions = self._net(user_ids_tensor, item_ids_tensor)\n",
        "        \n",
        "        loss = self._loss_func(ratings_tensor, predictions)\n",
        "        return loss.data.item()\n",
        "      \n",
        "    def predict(self,user_ids, item_ids):\n",
        "        self._net.train(False)\n",
        "        user_ids = user_ids.astype(np.int64)\n",
        "        item_ids = item_ids.astype(np.int64)\n",
        "        \n",
        "        user_ids_tensor = gpu(torch.from_numpy(user_ids),\n",
        "                                  self._use_cuda)\n",
        "        item_ids_tensor = gpu(torch.from_numpy(item_ids),\n",
        "                                  self._use_cuda)\n",
        "               \n",
        "        predictions = self._net(user_ids_tensor, item_ids_tensor)\n",
        "        return predictions.data\n",
        "      "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qGuzPUNrUUG_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model = FactorizationModel(embedding_dim=128,  # latent dimensionality\n",
        "                                   n_iter=10,  # number of epochs of training\n",
        "                                   batch_size=1024,  # minibatch size\n",
        "                                   learning_rate=1e-3,\n",
        "                                   l2=1e-9,  # strength of L2 regularization\n",
        "                                   use_cuda=torch.cuda.is_available(),\n",
        "                                   num_users=total_user_id+1,\n",
        "                                   num_items=total_item_id+1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mMKyTn4QUUHC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "user_ids_train_np = user_id_train.values.astype(np.int32)\n",
        "item_ids_train_np = item_id_train.values.astype(np.int32)\n",
        "ratings_train_np = rating_train.values.astype(np.float32)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2Bd2CIosUUHJ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model.fit(user_ids_train_np, item_ids_train_np, ratings_train_np)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "W9BIV9WaUUHN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "print(model._net)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HKyc7Bg5UUHR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "user_ids_test_np = user_id_test.values.astype(np.int64)\n",
        "item_ids_test_np = item_id_test.values.astype(np.int64)\n",
        "ratings_test_np = rating_test.values.astype(np.float32)\n",
        "model.test(user_ids_test_np, item_ids_test_np, ratings_test_np  )"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ltvw1o_dVjbm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.metrics import mean_squared_error\n",
        "from sklearn.metrics import mean_absolute_error\n",
        "\n",
        "test_preds = model.predict(user_ids_test_np, item_ids_test_np)\n",
        "print(\"Final test RMSE: %0.3f\" % np.sqrt(mean_squared_error(test_preds.cpu(), ratings_test_np)))\n",
        "print(\"Final test MAE: %0.3f\" % mean_absolute_error(test_preds.cpu(), ratings_test_np))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yXVIF75IUUHW",
        "colab_type": "text"
      },
      "source": [
        "You can compare with [Surprise](https://github.com/NicolasHug/Surprise)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MwQ3gXyuUUHY",
        "colab_type": "text"
      },
      "source": [
        "## 4. Best and worst movies"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RxwaryZZUUHY",
        "colab_type": "text"
      },
      "source": [
        "Getting the name of the movies (there must be a better way, please provide alternate solutions!)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "I-hyp9IpUUHa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "df = pd.read_csv(op.join(ML_100K_FOLDER, 'u.item'), sep='|', names=['item_id', 'item_name','date','','','','','','','','','','','','','','','','','','','','',''],encoding = \"ISO-8859-1\")\n",
        "movies_names = df.loc[:,['item_id', 'item_name']]\n",
        "movies_names = movies_names.set_index(['item_id'])\n",
        "movies_names.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "B8aAYBuhUUHf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "item_bias_np = model._net.item_biases.weight.data.cpu().numpy()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZhtanoCiUUHi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "movies_names['biases'] = pd.Series(item_bias_np[1:].T[0], index=movies_names.index)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2TAx4Tz6UUHk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "movies_names.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gn8gcmQjUUHo",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "movies_names.shape"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Q6ZmJYIdUUHt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "indices_item_train = np.sort(item_id_train.unique())\n",
        "movies_names = movies_names.loc[indices_item_train]\n",
        "movies_names.shape"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iUqELsLPUUH1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "movies_names = movies_names.sort_values(ascending=False,by=['biases'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7TG247DXUUH4",
        "colab_type": "text"
      },
      "source": [
        "Best movies"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZzQGi5VIUUH5",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "movies_names.head(10)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y3fHTO-eUUH_",
        "colab_type": "text"
      },
      "source": [
        "Worse movies"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1-EhFYOwUUIA",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "movies_names.tail(10)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XwOYSxcZUUII",
        "colab_type": "text"
      },
      "source": [
        "## 5. SPOTLIGHT\n",
        "\n",
        "The code written above is a simplified version of [SPOTLIGHT](https://github.com/maciejkula/spotlight)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-nPXmJ_nUUIJ",
        "colab_type": "text"
      },
      "source": [
        "Once you installed it with: `conda install -c maciejkula -c pytorch spotlight=0.1.5`, you can compare the results..."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aCEqMg9WUUIK",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}